Skip to content

fix: Avoid unnecessary type casts in concat_ws#20436

Open
neilconway wants to merge 3 commits intoapache:mainfrom
neilconway:neilc/concat-ws-type-fixes
Open

fix: Avoid unnecessary type casts in concat_ws#20436
neilconway wants to merge 3 commits intoapache:mainfrom
neilconway:neilc/concat-ws-type-fixes

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Feb 19, 2026

Which issue does this PR close?

Rationale for this change

  1. concat_ws returned Utf8, regardless of the input types it was called with. If it was called with LargeUtf8, returning Utf8 might overflow. In general, functions like these should operate on all three string representations unless there is a compelling reason not to (e.g., this is how concat works).
  2. simplify_concat_ws always constructed new literals with type Utf8. This lead to unnecessary casts when its inputs were of a different string type.

What changes are included in this PR?

  • Support concat_ws return type matching its input types, following how concat does it.
  • In simplify_concat_ws, construct literals with the right type, not always Utf8
  • Refactor return_type for concat to be more readable
  • Make StringViewArrayBuilder API more similar to the other string array builders, WRT null handling
  • Add new unit and SLT tests
  • Update test output for changed types

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes: some queries involving concat_ws will now omit unnecessary cast operations, and the return type of concat_ws might be any of the three string types. Generally these changes should match user expectations better than the previous behavior.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 19, 2026
@Omega359
Copy link
Contributor

I did a quick look at the changes and nothing obvious jumped out at me. I'll try and find time to do a more extensive review if no one else beats me to it.

@neilconway
Copy link
Contributor Author

@Omega359 Thank you!

@Omega359
Copy link
Contributor

🤖 /home/bruce/gh_compare_branch_bench.sh Benchmark Script Running
Linux fedora 6.18.12-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 16 18:58:26 UTC 2026 x86_64 GNU/Linux
Comparing neilc/concat-ws-type-fixes (9c0b4f4) to c3f0807 diff
BENCH_NAME=concat_ws
BENCH_COMMAND=cargo bench --bench concat_ws
BENCH_FILTER=
BENCH_BRANCH_NAME=neilc_concat-ws-type-fixes
Results will be posted here when complete

@Omega359
Copy link
Contributor

🤖: Benchmark completed

Details

group                                  main                                   neilc_concat-ws-type-fixes
-----                                  ----                                   --------------------------
concat_ws function/concat_ws/1024      1.00     11.1±0.08µs        ? ?/sec    1.16     12.9±2.43µs        ? ?/sec
concat_ws function/concat_ws/4096      1.00     44.1±0.88µs        ? ?/sec    1.02     44.8±1.95µs        ? ?/sec
concat_ws function/concat_ws/8192      1.00     88.5±3.98µs        ? ?/sec    1.02     90.4±3.22µs        ? ?/sec
concat_ws function/concat_ws/scalar    1.00     28.3±0.15µs        ? ?/sec    1.02     28.8±0.41µs        ? ?/sec

builder.append_offset();
continue;
match return_datatype {
DataType::Utf8View => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all this duplicated code could be eliminated with an approach similar to

trait StringArrayBuilderType: ArrayBuilder {
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that would make sense to do. I'm inclined to do it as a follow-up PR -- let me know if you'd prefer it as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is fine.

@Omega359
Copy link
Contributor

LGTM. @Jefffrey, @alamb I believe this is ready for final review and approval.

Ok(dt.to_owned())
if arg_types.contains(&Utf8View) {
Ok(Utf8View)
} else if arg_types.contains(&LargeUtf8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a thought about this. I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32) https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.LargeUtf8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. I believe the typical precedence today is Utf8View > LargeUtf8 > Utf8, partly on the grounds that "StringArray to StringViewArray is cheap but not vice versa". I can see arguments for both sides; if we want to reconsider this, seems like a distinct issue?

/// Coercion rules for string view types (Utf8/LargeUtf8/Utf8View):
/// If at least one argument is a string view, we coerce to string view
/// based on the observation that StringArray to StringViewArray is cheap but not vice versa.
///
/// Between Utf8 and LargeUtf8, we coerce to LargeUtf8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely it should be another issue as it likely occurs in a few places. I am fairly certain I am correct on the proper type ordering here but in the wild I doubt it would be encountered much - just how many columns would have > 2 billion bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is really only pertinent in areas that pick a return type based on multiple columns. For the typical case where the udf is operating on a single column the existing logic should be fine - such as in btrim

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)

I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB

Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)

I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB

Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)

Indeed, that was the point I was trying to get across. It's a rare ... but possible. Though honestly I expect DF would fail somewhere else pretty quickly if a column with data that big was ever encountered.

@neilconway
Copy link
Contributor Author

@alamb This is ready to be reviewed and/or merged, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

concat_ws does unnecessary type casts

3 participants